Week 3 - Data Quality and tidyr

Emorie D Beck

Outline

  1. Welcome & Q’s on homework
  2. Part 1: Data Quality and Descriptives
  3. Part 2: tidyr
  4. Problem set & Q time

Outline

1. Welcome & Q’s on homework 2. Part 1: Data Quality and Descriptives 3. Part 2: tidyr 4. Problem set & Q time

Outline

1. Welcome & Q’s on homework 2. Part 1: Data Quality and Descriptives 3. Part 2: tidyr 4. Problem set & Q time

DATA QUALITY

What is data quality

IBM’s definition of data quality:

“Data quality measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose”

Aspects of data quality

  • Accuracy: Do the data reflect reality / truth?
  • Completeness: Are the data usable or complete (no missing people, values, etc. beyond random)
  • Uniqueness: There is no duplicated data
  • Validity: Do the data have the correct properties (values, ranges, etc.)
  • Consistency: When integrating across multiple data sources, information should converge across sources and match reality
  • Timeliness: Can the data be maintained and distributed within a specified time frame
  • Fitness for purpose: Do the data meet your research need?

Why should we care about data quality

  • You aren’t responsible for poor quality data you receive, but you are responsible for the data products you work with – that is, you are responsible for improving data quality
  • Poor quality data threatens scientific integrity
  • Poor quality data are a pain for you to work with and for others to work with

What can data quality do for my career?

  • The virtuous cycle of data cleaning
    • Some people get a reputation for getting their data, analyses, etc. right
    • This is important for publications, grant funding, etc.
    • It tends to be inter-generational – you inherit some of your reputation on this from your advisor
    • Start paying it forward now to build your own career, whether it’s in academia or industry

How do I ensure data quality?

The Towards Data Science website has a nice definition of EDA:

“Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics”

  • So EDA is basically a fancy word for the descriptive statistics you’ve been learning about for years

How do I ensure data quality?

I think about “exploratory data analysis for data quality”

  • Investigating values and patterns of variables from “input data”
  • Identifying and cleaning errors or values that need to be changed
  • Creating analysis variables
  • Checking values of analysis variables against values of input variables

How I will teach exploratory data analysis

Will teach exploratory data analysis (EDA) in two sub-sections:

  1. Provide “Guidelines for EDA”
    • Less about coding, more about practices you should follow and mentality necessary to ensure high data quality
  2. Introduce “Tools of EDA”:
    • Demonstrate code to investigate variables and relationship between variables
    • Most of these tools are just the application of programming skills you have already learned (or will learn soon!)

Guidelines for “EDA for data quality”

Assume that your goal in “EDA for data quality” is to investigate “input” data sources and create “analysis variables”

  • Usually, your analysis dataset will incorporate multiple sources of input data, including data you collect (primary data) and/or data collected by others (secondary data)

Guidelines for “EDA for data quality”

EDA is not a linear process, and the process will vary across people and projects Some broad steps:

  1. Understand how input data sources were created
    • e.g., when working with survey data, have survey questionnaire and codebooks on hand
    • watch out for skip patterns!!!
  2. For each input data source, identify the “unit of analysis” and which combination of variables uniquely identify observations
  3. Investigate patterns in input variables
  4. Create analysis variable from input variable(s)
  5. Verify that analysis variable is created correctly through descriptive statistics that compare values of input variable(s) against values of the analysis variable

Guidelines for “EDA for data quality”

  • It is critically important to step through EDA processes at multiple points during data cleaning, from the input / raw data to the output / analysis / clean data.

  • Always be aware of missing values

  • They will not always be coded as NA in input variables (e.g., some projects code them as 99, 99999, negative values, etc.)

“Unit of analysis” and which variables uniquely identify observations

“Unit of analysis” refers to “what does each observation represent” in an input data source

  • If each obs represents a trial in an experiment, you have “trial level data”
  • If each obs represents a participant, you have “participant level data”
  • If each obs represents a sample, you have “sample-level data”
  • If each obs represents a year, you have “year level data” (i.e. longitudinal)

“Unit of analysis” and which variables uniquely identify observations

How to identify unit of analysis

  • data documentation

  • investigating the data set

  • This is very important because we often conduct analyses that span multiple units of analysis (e.g., between- v within-person, person- v stimuli-level, etc.)

  • We have to be careful and thoughtful about identifiers that let us do that (important for joining data together, which will be the focus on our R workshop today)

Rules for creating new variables

Rules I follow for variable creation

  1. Never modify “input variable”; instead create new variable based on input variable(s)
    • Always keep input variables used to create new variables
  2. Investigate input variable(s) and relationship between input variables
  3. Developing a plan for creation of analysis variable
    • e.g., for each possible value of input variables, what should value of analysis variable be?
  4. Write code to create analysis variable
  5. Run descriptive checks to verify new variables are constructed correctly
    • Can “comment out” these checks, but don’t delete them
  6. Document new variables with notes and labels

DESCRIPTIVES

EDA

  • One-way descriptive analyses (i.e,. focus on one variable)
    • Descriptive analyses for continuous variables
    • Descriptive analyses for discreet/categorical variables
  • Two-way descriptive analyses (relationship between two variables)
    • Categorical by categorical
    • Categorical by continuous
    • Continuous by continuous

One-way descriptive analyses

One-way descriptive analyses: Continuous

One-way descriptive analyses: Categorical / Count

Two-way descriptive analyses

Two-way descriptive analyses: Categorical x Categorical

Two-way descriptive analyses: Categorical x Continuous

Two-way descriptive analyses: Continuous x continuous

Outline

1. Welcome & Q’s on homework 2. Part 1: Data Quality and Descriptives 3. Part 2: tidyr 4. Problem set & Q time

Outline

1. Welcome & Q’s on homework 2. Part 1: Data Quality and Descriptives 3. Part 2: tidyr 4. Problem set & Q time

Data Wrangling in tidyr

tidyr

  • Now, let’s build off what we learned from dplyr and focus on reshaping and merging our data.
  • First, the reshapers:
  1. pivot_longer(), which takes a “wide” format data frame and makes it long.
  2. pivot_wider(), which takes a “long” format data frame and makes it wide.

tidyr

  • Next, the mergers:
  1. full_join(), which merges all rows in either data frame
  2. inner_join(), which merges rows whose keys are present in both data frames
  3. left_join(), which “prioritizes” the first data set
  4. right_join(), which “prioritizes” the second data set

(See also:anti_join() and semi_join())

Key tidyr Functions

1. pivot_longer()

  • (Formerly gather()) Makes wide data long, based on a key
  • Core arguments:
    • data: the data, blank if piped
    • cols: columns to be made long, selected via select() calls
    • names_to: name(s) of key column(s) in new long data frame (string or string vector)
    • values_to: name of values in new long data frame (string)
    • names_sep: separator in column headers, if multiple keys
    • values_drop_na: drop missing cells (similar to na.rm = T)

1. pivot_longer(): Basic Application

Let’s start with an easy one – one key, one value:

bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = A1:O5
    , names_to = "item"
    , values_to = "values"
    , values_drop_na = T
  ) %>%
  print(n = 8)
# A tibble: 69,492 × 6
  SID   gender education   age item  values
  <chr>  <int>     <int> <int> <chr>  <int>
1 61617      1        NA    16 A1         2
2 61617      1        NA    16 A2         4
3 61617      1        NA    16 A3         3
4 61617      1        NA    16 A4         4
5 61617      1        NA    16 A5         4
6 61617      1        NA    16 C1         2
7 61617      1        NA    16 C2         3
8 61617      1        NA    16 C3         3
# ℹ 69,484 more rows

1. pivot_longer(): More Advanced Application

Now a harder one – two keys, one value:

bfi %>%
  rownames_to_column("SID") %>%
  pivot_longer(
    cols = A1:O5
    , names_to = c("trait", "item_num")
    , names_sep = -1
    , values_to = "values"
    , values_drop_na = T
  ) %>%
  print(n = 8)
# A tibble: 69,492 × 7
  SID   gender education   age trait item_num values
  <chr>  <int>     <int> <int> <chr> <chr>     <int>
1 61617      1        NA    16 A     1             2
2 61617      1        NA    16 A     2             4
3 61617      1        NA    16 A     3             3
4 61617      1        NA    16 A     4             4
5 61617      1        NA    16 A     5             4
6 61617      1        NA    16 C     1             2
7 61617      1        NA    16 C     2             3
8 61617      1        NA    16 C     3             3
# ℹ 69,484 more rows

2. pivot_wider()

  • (Formerly spread()) Makes wide data long, based on a key
  • Core arguments:
    • data: the data, blank if piped
    • names_from: name(s) of key column(s) in new long data frame (string or string vector)
    • names_sep: separator in column headers, if multiple keys
    • names_glue: specify multiple or custom separators of multiple keys
    • values_from: name of values in new long data frame (string)
    • values_fn: function applied to data with duplicate labels

2. pivot_wider(): Basic Application

bfi_long %>%
  pivot_wider(
    names_from = "item"
    , values_from = "values"
  )
# A tibble: 2,800 × 29
   SID   gender education   age    A1    A2    A3    A4    A5    C1    C2    C3
   <chr>  <int>     <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1 61617      1        NA    16     2     4     3     4     4     2     3     3
 2 61618      2        NA    18     2     4     5     2     5     5     4     4
 3 61620      2        NA    17     5     4     5     4     4     4     5     4
 4 61621      2        NA    17     4     4     6     5     5     4     4     3
 5 61622      1        NA    17     2     3     3     4     5     4     4     5
 6 61623      2         3    21     6     6     5     6     5     6     6     6
 7 61624      1        NA    18     2     5     5     3     5     5     4     4
 8 61629      1         2    19     4     3     1     5     1     3     2     4
 9 61630      1         1    19     4     3     6     3     3     6     6     3
10 61633      2        NA    17     2     5     6     6     5     6     5     6
# ℹ 2,790 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
#   O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>

2. pivot_wider(): More Advanced

bfi_long %>%
  pivot_wider(
    names_from = c("trait", "item_num")
    , values_from = "values"
    , names_sep = "_"
  )
# A tibble: 2,800 × 29
   SID   gender education   age   A_1   A_2   A_3   A_4   A_5   C_1   C_2   C_3
   <chr>  <int>     <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1 61617      1        NA    16     2     4     3     4     4     2     3     3
 2 61618      2        NA    18     2     4     5     2     5     5     4     4
 3 61620      2        NA    17     5     4     5     4     4     4     5     4
 4 61621      2        NA    17     4     4     6     5     5     4     4     3
 5 61622      1        NA    17     2     3     3     4     5     4     4     5
 6 61623      2         3    21     6     6     5     6     5     6     6     6
 7 61624      1        NA    18     2     5     5     3     5     5     4     4
 8 61629      1         2    19     4     3     1     5     1     3     2     4
 9 61630      1         1    19     4     3     6     3     3     6     6     3
10 61633      2        NA    17     2     5     6     6     5     6     5     6
# ℹ 2,790 more rows
# ℹ 17 more variables: C_4 <int>, C_5 <int>, E_1 <int>, E_2 <int>, E_3 <int>,
#   E_4 <int>, E_5 <int>, N_1 <int>, N_2 <int>, N_3 <int>, N_4 <int>,
#   N_5 <int>, O_1 <int>, O_2 <int>, O_3 <int>, O_4 <int>, O_5 <int>

2. pivot_wider(): A Little More Advanced

bfi_long %>%
  select(-item_num) %>%
  pivot_wider(
    names_from = "trait"
    , values_from = "values"
    , names_sep = "_"
    , values_fn = mean
  )
# A tibble: 2,800 × 9
   SID   gender education   age     A     C     E     N     O
   <chr>  <int>     <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 61617      1        NA    16   3.4   3.2  3.4    2.8   3.8
 2 61618      2        NA    18   3.6   4    3      3.8   3.2
 3 61620      2        NA    17   4.4   4    3.8    3.6   3.6
 4 61621      2        NA    17   4.8   4.2  4      2.8   3.6
 5 61622      1        NA    17   3.4   3.6  3.6    3.2   3.2
 6 61623      2         3    21   5.6   4.4  4      3     3.8
 7 61624      1        NA    18   4     3.6  4.2    1.4   3.8
 8 61629      1         2    19   2.8   3    3.2    4.2   3.4
 9 61630      1         1    19   3.8   4.8  3.75   3.6   5  
10 61633      2        NA    17   4.8   4    3.6    4.2   3.6
# ℹ 2,790 more rows

More dplyr Functions

The _join() Functions

  • Often we may need to pull different data from different sources
  • There are lots of reasons to need to do this
  • We don’t have time to get into all the use cases here, so we’ll talk about them in high level terms
  • We’ll focus on:
    • full_join()
    • inner_join()
    • left_join()
    • right_join()

The _join() Functions

  • Let’s separate demographic and BFI data
bfi_only <- bfi %>% 
  rownames_to_column("SID") %>%
  select(SID, matches("[0-9]"))
bfi_only %>% as_tibble() %>% print(n = 6)
# A tibble: 2,800 × 26
  SID      A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3     3     4     4     3     3
2 61618     2     4     5     2     5     5     4     4     3     4     1     1
3 61620     5     4     5     4     4     4     5     4     2     5     2     4
4 61621     4     4     6     5     5     4     4     3     5     5     5     3
5 61622     2     3     3     4     5     4     4     5     3     2     2     2
6 61623     6     6     5     6     5     6     6     6     1     3     2     1
# ℹ 2,794 more rows
# ℹ 13 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
#   O5 <int>
bfi_dem <- bfi %>%
  rownames_to_column("SID") %>%
  select(SID, education, gender, age)
bfi_dem %>% as_tibble() %>% print(n = 6)
# A tibble: 2,800 × 4
  SID   education gender   age
  <chr>     <int>  <int> <int>
1 61617        NA      1    16
2 61618        NA      2    18
3 61620        NA      2    17
4 61621        NA      2    17
5 61622        NA      1    17
6 61623         3      2    21
# ℹ 2,794 more rows

3. full_join()

Most simply, we can put those back together keeping all observations.

bfi_only %>%
  full_join(bfi_dem) %>%
  as_tibble() %>%
  print(n = 6)
# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3     3     4     4     3     3
2 61618     2     4     5     2     5     5     4     4     3     4     1     1
3 61620     5     4     5     4     4     4     5     4     2     5     2     4
4 61621     4     4     6     5     5     4     4     3     5     5     5     3
5 61622     2     3     3     4     5     4     4     5     3     2     2     2
6 61623     6     6     5     6     5     6     6     6     1     3     2     1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
#   O5 <int>, education <int>, gender <int>, age <int>
bfi %>%
  rownames_to_column("SID") %>%
  as_tibble() %>%
  print(n = 6)
# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3     3     4     4     3     3
2 61618     2     4     5     2     5     5     4     4     3     4     1     1
3 61620     5     4     5     4     4     4     5     4     2     5     2     4
4 61621     4     4     6     5     5     4     4     3     5     5     5     3
5 61622     2     3     3     4     5     4     4     5     3     2     2     2
6 61623     6     6     5     6     5     6     6     6     1     3     2     1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
#   O5 <int>, gender <int>, education <int>, age <int>

4. inner_join()

We can also keep all rows present in both data frames

bfi_dem %>%
  filter(row_number() %in% 1:1700) %>%
  inner_join(
    bfi_only %>%
      filter(row_number() %in% 1200:2800)
  ) %>%
  as_tibble() %>%
  print(n = 6)
# A tibble: 501 × 29
  SID   education gender   age    A1    A2    A3    A4    A5    C1    C2    C3
  <chr>     <int>  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 64151         3      2    18     1     5     6     5     5     5     6     5
2 64152         4      2    29     1     5     6     5     5     2     1     4
3 64154         5      1    46     2     5     6     5     6     6     6     6
4 64155         5      1    58     5     4     4     4     5     4     4     5
5 64156         5      2    38     1     4     6     6     6     4     4     5
6 64158         5      2    27     2     3     1     1     1     4     2     2
# ℹ 495 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
#   O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
bfi %>%
  rownames_to_column("SID") %>%
  as_tibble() %>%
  print(n = 6)
# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3     3     4     4     3     3
2 61618     2     4     5     2     5     5     4     4     3     4     1     1
3 61620     5     4     5     4     4     4     5     4     2     5     2     4
4 61621     4     4     6     5     5     4     4     3     5     5     5     3
5 61622     2     3     3     4     5     4     4     5     3     2     2     2
6 61623     6     6     5     6     5     6     6     6     1     3     2     1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
#   O5 <int>, gender <int>, education <int>, age <int>

5. left_join()

Or all rows present in the left (first) data frame, perhaps if it’s a subset of people with complete data

bfi_dem %>%
  drop_na() %>%
  left_join(bfi_only) %>%
  as_tibble() %>%
  print(n = 6)
# A tibble: 2,577 × 29
  SID   education gender   age    A1    A2    A3    A4    A5    C1    C2    C3
  <chr>     <int>  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61623         3      2    21     6     6     5     6     5     6     6     6
2 61629         2      1    19     4     3     1     5     1     3     2     4
3 61630         1      1    19     4     3     6     3     3     6     6     3
4 61634         1      1    21     4     4     5     6     5     4     3     5
5 61640         1      1    17     4     5     2     2     1     5     5     5
6 61661         5      1    68     1     5     6     5     6     4     3     2
# ℹ 2,571 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
#   O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
bfi %>%
  rownames_to_column("SID") %>%
  as_tibble() %>%
  print(n = 6)
# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3     3     4     4     3     3
2 61618     2     4     5     2     5     5     4     4     3     4     1     1
3 61620     5     4     5     4     4     4     5     4     2     5     2     4
4 61621     4     4     6     5     5     4     4     3     5     5     5     3
5 61622     2     3     3     4     5     4     4     5     3     2     2     2
6 61623     6     6     5     6     5     6     6     6     1     3     2     1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
#   O5 <int>, gender <int>, education <int>, age <int>

6. right_join()

Or all rows present in the right (second) data frame, such as I do when I join a codebook with raw data

bfi_dem %>%
  drop_na() %>%
  right_join(bfi_only) %>%
  as_tibble() %>%
  print(n = 6)
# A tibble: 2,800 × 29
  SID   education gender   age    A1    A2    A3    A4    A5    C1    C2    C3
  <chr>     <int>  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61623         3      2    21     6     6     5     6     5     6     6     6
2 61629         2      1    19     4     3     1     5     1     3     2     4
3 61630         1      1    19     4     3     6     3     3     6     6     3
4 61634         1      1    21     4     4     5     6     5     4     3     5
5 61640         1      1    17     4     5     2     2     1     5     5     5
6 61661         5      1    68     1     5     6     5     6     4     3     2
# ℹ 2,794 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
#   E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
#   O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
bfi %>%
  rownames_to_column("SID") %>%
  as_tibble() %>%
  print(n = 6)
# A tibble: 2,800 × 29
  SID      A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617     2     4     3     4     4     2     3     3     4     4     3     3
2 61618     2     4     5     2     5     5     4     4     3     4     1     1
3 61620     5     4     5     4     4     4     5     4     2     5     2     4
4 61621     4     4     6     5     5     4     4     3     5     5     5     3
5 61622     2     3     3     4     5     4     4     5     3     2     2     2
6 61623     6     6     5     6     5     6     6     6     1     3     2     1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
#   N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
#   O5 <int>, gender <int>, education <int>, age <int>

Attributions

Part 1 of these slides was adapted from